Main analysis (Exploratory Data Analysis)
1. Analysis on Age distribution of our customers
df_2month = full_join(summary_18_6, summary_17_6, by = "factor_year")
## Warning: Column `factor_year` joining factors with different levels,
## coercing to character vector
df_3month = full_join(df_2month, summary_16_6, by = "factor_year")
## Warning: Column `factor_year` joining character vector and factor, coercing
## into character vector
t1 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.x, time = df_3month$time.x)
t2 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.y, time = df_3month$time.y)
t3 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq, time = df_3month$time)
tf = rbind(t1,t2,t3)
ggplot(tf, aes(x = factor_year, y = Freq, color = as.factor(time))) + geom_point() +
coord_flip() + xlab("user's birth year") +
ggtitle("Frequency of Citi Bike users' birth year in different months")+
labs(fill="Time")
## Warning: Removed 39 rows containing missing values (geom_point).
This Cleveland Dot Plot shows the distribution of user’s birth year in 2016/6, 2017/6 and 2018/6. They are all left-skewed, which is what we expect. It is glad to see number of users for all ages increases from 2016 to 2018. In 2018 June, the active users are aged between 24 and 37.
set.seed(321)
index = sample(nrow(df), round(nrow(df)*0.001))
sample = df[index,]
g = ggplot(sample, aes(birth.year,minute))
g + geom_point(alpha = 0.5, col = "#fb6a4a") + ylab("trip duration (in min)") +
xlab("users' birth year") + ggtitle("Scatterplot of uses' birth year and trip duration (in sec) in Jun 2018")
There are 1.945,611 observations in our data set, so we sampled 1% of them to see the relationship between user’s birth of year and trip durations.
We were expect to see an increasing and then decreasing trend as in the distribution of users’ birth year (right-skewed). However, there is no obvious trend in the scatterplot. Again, since NA birth year were transfered to 1969, there are a lot observations in 1969 and its covers a wide range of trip durations, which is misleading.
2. Analysis on Gender of our customers
gender = data.frame(group = c("unkown", "male", "female"),
value = summary(as.factor(df$gender)))
pie = ggplot(gender, aes(x="", y = value, fill=group)) +
geom_bar(width = 1, stat="identity") +
coord_polar("y", start = 0)
blank_theme <- theme_minimal()+
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid=element_blank(),
axis.ticks = element_blank(),
plot.title=element_text(size=14, face="bold")
)
pie + scale_fill_manual(values=c("#fbb4ae", "#b3cde3", "#ccebc5")) +
blank_theme + theme(axis.text.x=element_blank()) +
geom_text(aes(y = value/3 + c(0, cumsum(value)[-length(value)]),
label = c("10%","66%", "24%")), size=5) +
ggtitle("Ratio of users' gender of Citi Bike ")
There are 66% male users, 24% female users and 10% unkown. Apprently, there is a big differnece betwee female users and male users. We have seen some news report that Citi Bike users sometimes complain about the weight of citibike. Maybe that is one of the reasons that female users are small. Thus, we can think of some strategies for female users to increase profits because there is indeed room for improvement.
3. Analysis on Station Distribution and Corresponding Number of Bikes
Below is the heatmap of number of trips for each station. We also label the top 10 popular stations
## get station info
station.info <- data_18_6 %>%
group_by(start.station.id) %>%
summarise(lat=as.numeric(start.station.latitude[1]),
long=as.numeric(start.station.longitude[1]),
name=start.station.name[1],
n.trips=n())
pal = colorNumeric(
palette="YlOrRd",
domain = station.info$n.trips
)
top10 = station.info %>% dplyr::arrange(desc(n.trips))
icon <- makeIcon(
iconUrl = "citi_bike_icon.png",
iconWidth = 20, iconHeight = 20,
iconAnchorX = 22, iconAnchorY = 22,
shadowWidth = 20, shadowHeight = 20,
shadowAnchorX = 4, shadowAnchorY = 62
)
leaflet(station.info) %>%setView(-74.00, 40.71, zoom = 12) %>%
leaflet::addProviderTiles("CartoDB.Positron") %>%
leaflet::addCircles(lng = ~long, lat = ~lat, color =~pal(n.trips)) %>%
leaflet::addLegend("bottomleft", pal = pal, value = ~n.trips, title = "Num of trips starting from this station") %>%
addMarkers(lng = ~top10[1:10,]$long, lat = ~top10[1:10,]$lat, icon = icon)
We can see that there are more users starting their trips in Manhattan. In Manhattan, Midtown is the most popular area. The station near Central Park is also a popular spot, probably because there are some tourists riding inside Central Park. Back to our first question in the introduction, we are wondering why we do not see a lot of people around campus riding with Citi Bike. The number of stations around campus is not limited, but the starting rides from those stops are small. One possible explanation we believe is that there are quite amount of students living in dorms or very close to campus. So they do not need bicycles to transport.
4. Analysis on weather and season
Join weather data to see if weather has an impact
Is it reasonable to assume that weather and season has a huge impact on CitiBike frequency.
#NYCweather<-read.csv("./data/NYCweather.csv")
NYCweather<-read.csv("NYCweather.csv")
df$Date <- as.Date(df$starttime)
NYCweather$Date<-as.Date(NYCweather$Date)
df_weather <- inner_join(df,NYCweather,by="Date")
df_weather$Severe <- as.factor(df_weather$Severe)
hist(df_weather$Date, breaks="days", freq=TRUE)
ggplot(df_weather, aes(x=Date))+
geom_histogram(colour="white", fill="steelblue")+
ggtitle("Number of trips by day")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(df_weather, aes(x=Severe, y=))+
geom_bar(colour="white", fill="steelblue")+
ggtitle("Number of trips per day for severe and unsevere weather condition")
ggplot(data = df_weather,
aes(x = Windspeed, y = minute)) +
scale_x_continuous("Windspeed") +
scale_y_continuous("Average trip duration in minutes") +
ggtitle("Trip duration vs. Windspeed") +
stat_summary(fun.y="mean", geom = "line", colour="steelblue", size=1)
ggplot(data = df_weather,
aes(x = Temperature, y = minute)) +
scale_x_continuous("Temperature") +
scale_y_continuous("Average trip duration in minutes") +
ggtitle("Trip duration vs. Temperature") +
stat_summary(fun.y="mean", geom="line", colour="steelblue", size=1)
ggplot(data = df_weather,
aes(x = Precipitation, y = minute)) +
scale_x_continuous("Precipitation") +
scale_y_continuous("Average trip duration in minutes") +
ggtitle("Trip duration vs. Precipitation") +
stat_summary(fun.y="mean", geom = "line", colour="steelblue", size=1)
ggplot(data = df_weather,
aes(x=factor(Severe), y=minute))+
xlab("Severity of weather")+ylab("Average trip length")+
stat_summary(fun.y="mean", geom="bar")
df_by_temp <- df_weather %>% group_by(Temperature) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_temp, aes(x=as.numeric(Temperature), y=`n_distinct(starttime)`))+
stat_summary(geom="bar", fill="steelblue")+
xlab("Temperature")+
ylab("Average trips per day")+
ggtitle("Average daily trips in different temperatures")
## No summary function supplied, defaulting to `mean_se()
df_by_windspeed <- df_weather %>% group_by(Windspeed) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_windspeed, aes(x=as.numeric(Windspeed), y=`n_distinct(starttime)`))+
stat_summary(geom="bar", fill="steelblue")+
xlab("Windspeed")+
ylab("Average trips per day")+
ggtitle("Average daily trips in different windspeeds")
## No summary function supplied, defaulting to `mean_se()
df_by_precip <- df_weather %>% group_by(Precipitation) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_precip, aes(x=as.numeric(Precipitation), y=`n_distinct(starttime)`))+
stat_summary(geom="bar", fill="steelblue")+
xlab("Precipitation")+
ylab("Average trips per day")+
ggtitle("Average daily trips in different precipitations")
## No summary function supplied, defaulting to `mean_se()
Interactive time series to see if season has an impact on customer flow with Dygraph
We combine over a year time series to see if season has an impact on customer flow with Dygraph. The data is from 201706 to 201806.
library(dygraphs)
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following object is masked from 'package:leaflet':
##
## addLegend
## The following objects are masked from 'package:dplyr':
##
## first, last
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:memisc':
##
## is.interval
## The following object is masked from 'package:base':
##
## date
library(zoo)
#df_timeseries <- read.csv('./data/data_2017_2018.csv')
df_timeseries <- read.csv('data_2017_2018.csv')
a <- lubridate::mdy_hms(df_timeseries$starttime)
## Warning: All formats failed to parse. No formats found.
b <- as.Date(df_timeseries$starttime)
b[is.na(b)] <- a[!is.na(a)]
df_timeseries$starttime <- b
df_timeseries <- df_timeseries %>% mutate(time = format(as.Date(starttime), "%Y-%m"))
df_timeseries <- df_timeseries[,c("time","usertype")]
df_subscriber <- df_timeseries %>% subset(usertype == 'Subscriber')
df_customer <- df_timeseries %>% subset(usertype == 'Customer')
df_subscriber <- df_subscriber %>% group_by(time) %>% summarise(Number = n())
df_customer <- df_customer %>% group_by(time) %>% summarise(Number = n())
ts_subscriber <- xts::xts(df_subscriber$Number,as.Date(as.yearmon(df_subscriber$time)))
ts_customer <- xts::xts(df_customer$Number,as.Date(as.yearmon(df_customer$time)))
ts <- cbind(ts_subscriber,ts_customer)
dygraph(ts,main = 'Impact of season',ylab = "Frequency") %>% dySeries('..1',label = 'subscriber ') %>% dySeries('..2',label = 'customer') %>% dyLegend(show = "always", hideOnMouseOut = FALSE,width=400) %>%
dyOptions(colors = RColorBrewer::brewer.pal(3, "Set2"))
5. Analysis about customer flow
library(lubridate)
df_timeslot <- mutate(df,hour = hour(as.POSIXct(starttime)))
df_timeslot <- mutate(df_timeslot,wday = wday(as.POSIXct(starttime)))
df_timeslot <- df_timeslot[,c("hour","wday","minute","start.station.name","usertype")]
timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour))) +geom_bar(color='Orange')+ggtitle('Time slot in trip data group by usertype')
timeslot + facet_wrap(~usertype)
From the above analysis, the overall time slot has two peaks: in 8-9 am and 5-6 pm, which is reasonable since they are the rush hour, which has a large number of passenger flow.
However, the time slot distribution differs a lot with aspect to different usertype and different weekdays.
Take the above images as examples, the customer has a higher peak in the time slot which is around 4 pm while the subscribers has two peaks.
timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour))) +geom_bar(color='Orange')+ggtitle('Time slot in trip data group by weekdays')
timeslot + facet_wrap(~as.factor(wday))
Also, the weekdays play a big role in the time duration distribution. On weekdays, we still sees similar two peaks. However, on weekends, we can only observe one peak. Thus, we can think some specific pricing strategy for a specific time slot in the weekdays.